PM566 - Homework 2

Author

Chih-Chan Jessica Lan

Load Library

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.2     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.3.0
✔ purrr     1.1.0     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(maps)

Attaching package: 'maps'

The following object is masked from 'package:purrr':

    map
library(ggcorrplot)

Preprocessing

library(nycflights13)
airlines <- airlines
airports <- airports
flights <- flights
planes <- planes
weather <- weather

Question 1

Find the top most popular destinations

flights %>%
  group_by(dest) %>%
  count(sort = T) %>%
  top_n(10, n)
# A tibble: 105 × 2
# Groups:   dest [105]
   dest      n
   <chr> <int>
 1 ORD   17283
 2 ATL   17215
 3 LAX   16174
 4 BOS   15508
 5 MCO   14082
 6 CLT   14064
 7 SFO   13331
 8 FLL   12055
 9 MIA   11728
10 DCA    9705
# ℹ 95 more rows

Ans: From all flights in 2013 departing from New York City’s three main airports, the top 10 destinations are shown above. The busiest destination is Chicago O’Hare (ORD) with 17,283 flights, followed by Atlanta (ATL) with 17,215 and Los Angeles (LAX) with 16,174. The rest of the top ten include BOS, MCO, CLT, SFO, FLL, MIA, and DCA, in that order.

Question 2

flights <- flights %>%
  mutate(dep_cat = case_when(dep_time >= 0  & dep_time < 600  ~ "early morning",
                             dep_time >= 600  & dep_time < 1200 ~ "morning",
                             dep_time >= 1200 & dep_time < 1800 ~ "afternoon",
                             dep_time >= 1800 & dep_time <= 2400 ~ "evening",
                             TRUE ~ NA_character_),
         arr_cat = case_when(arr_time >= 0  & arr_time < 600  ~ "early morning",
                             arr_time >= 600  & arr_time < 1200 ~ "morning",
                             arr_time >= 1200 & arr_time < 1800 ~ "afternoon",
                             arr_time >= 1800 & arr_time <= 2400 ~ "evening",
                             TRUE ~ NA_character_
                             )
         ) %>%
  mutate(red_flight = ifelse(dep_cat %in% c("afternoon", "evening") &
           arr_cat %in% c("early morning", "morning"), T, F))
category_order <- c("early morning", "morning", "afternoon", "evening")

flights %>%
  ggplot(aes(x = factor(dep_cat, levels = category_order))) +
  geom_bar(fill = "darksalmon") +
  labs(title = "Departure Time Categories", x = "Departure Category", y = "Count") +
  theme_bw()

flights %>%
  ggplot(aes(x = factor(arr_cat, levels = category_order))) +
  geom_bar(fill = "gold") +
  labs(title = "Arrival Time Categories", x = "Arrival Category", y = "Count") +
  theme_bw()

flights %>%
  summarise(
    red_eye_count = sum(red_flight, na.rm = T),
    red_eye_pct = mean(red_flight, na.rm = T) * 100
    )
# A tibble: 1 × 2
  red_eye_count red_eye_pct
          <int>       <dbl>
1         10633        3.16

Ans: About 3.15% of the flights were red-eye flights. The percentage was calculated by excluding missing values.

Question 3

flights %>%
  filter(!is.na(tailnum)) %>%
  select(tailnum, carrier) %>%
  distinct(tailnum, carrier) %>%
  group_by(tailnum) %>%
  summarise(
    n = n(),
    airlines = paste(sort(unique(carrier)), collapse = ", ")
  ) %>%
  filter(n >= 2)
# A tibble: 17 × 3
   tailnum     n airlines
   <chr>   <int> <chr>   
 1 N146PQ      2 9E, EV  
 2 N153PQ      2 9E, EV  
 3 N176PQ      2 9E, EV  
 4 N181PQ      2 9E, EV  
 5 N197PQ      2 9E, EV  
 6 N200PQ      2 9E, EV  
 7 N228PQ      2 9E, EV  
 8 N232PQ      2 9E, EV  
 9 N933AT      2 DL, FL  
10 N935AT      2 DL, FL  
11 N977AT      2 DL, FL  
12 N978AT      2 DL, FL  
13 N979AT      2 DL, FL  
14 N981AT      2 DL, FL  
15 N989AT      2 DL, FL  
16 N990AT      2 DL, FL  
17 N994AT      2 DL, FL  

Ans: There were 17 planes that flew for more than one airline. The airplane tail numbers and their corresponding airlines are listed above.

Question 4

unique(weather$origin)
[1] "EWR" "JFK" "LGA"
length(unique(airports$faa))
[1] 1458

Ans: The weather data record hourly weather conditions for each airport (origin), and the airports data provide each airport’s location in America. Therefore, the relationship between airports and weather should be shown as one-to-many, meaning each of the three airports (EWR, JFK, and LGA) is linked to many weather observations in 2013.

Question 5

weather %>%
  mutate(key = paste(year, month, day, hour, origin)) %>%
  group_by(key) %>%
  count() %>%
  filter(n > 1)
# A tibble: 3 × 2
# Groups:   key [3]
  key                 n
  <chr>           <int>
1 2013 11 3 1 EWR     2
2 2013 11 3 1 JFK     2
3 2013 11 3 1 LGA     2
weather %>%
  filter(month == 11 & day == 3 & hour == 1)
# A tibble: 6 × 15
  origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
  <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
1 EWR     2013    11     3     1  52.0  39.0  61.2      310       6.90        NA
2 EWR     2013    11     3     1  50    39.0  65.8      290       5.75        NA
3 JFK     2013    11     3     1  54.0  37.9  54.5      320       9.21        NA
4 JFK     2013    11     3     1  52.0  37.9  58.6      310       6.90        NA
5 LGA     2013    11     3     1  55.0  39.0  54.7      330       9.21        NA
6 LGA     2013    11     3     1  54.0  39.9  58.9      310       8.06        NA
# ℹ 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
#   time_hour <dttm>

Ans: The six duplicate records all occurred on the same date, November 3, 2013, at around 1:00 AM. This happened because of the Daylight Saving Time (DST) change, when the clock was set back from 2:00 AM to 1:00 AM. As a result, the hour “1 AM” was recorded twice for each airport, creating duplicate entries in the weather data.

Question 6

flights <- flights %>%
  left_join(weather, by = c("origin", "time_hour"))

Step 2: Check the size of the data

dim(flights)
[1] 336776     35

There are 336,776 observations and 35 variables in flights data after merged with weather data.

Step 3: Examine the variables and their types

str(flights)
tibble [336,776 × 35] (S3: tbl_df/tbl/data.frame)
 $ year.x        : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month.x       : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ day.x         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
 $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
 $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
 $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
 $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
 $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
 $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
 $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
 $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
 $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
 $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
 $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
 $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
 $ hour.x        : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
 $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
 $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
 $ dep_cat       : chr [1:336776] "early morning" "early morning" "early morning" "early morning" ...
 $ arr_cat       : chr [1:336776] "morning" "morning" "morning" "morning" ...
 $ red_flight    : logi [1:336776] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ year.y        : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month.y       : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ day.y         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ hour.y        : int [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
 $ temp          : num [1:336776] 39 39.9 39 39 39.9 ...
 $ dewp          : num [1:336776] 28 25 27 27 25 ...
 $ humid         : num [1:336776] 64.4 54.8 61.6 61.6 54.8 ...
 $ wind_dir      : num [1:336776] 260 250 260 260 260 260 240 260 260 260 ...
 $ wind_speed    : num [1:336776] 12.7 15 15 15 16.1 ...
 $ wind_gust     : num [1:336776] NA 21.9 NA NA 23 ...
 $ precip        : num [1:336776] 0 0 0 0 0 0 0 0 0 0 ...
 $ pressure      : num [1:336776] 1012 1011 1012 1012 1012 ...
 $ visib         : num [1:336776] 10 10 10 10 10 10 10 10 10 10 ...
summary(flights)
     year.x        month.x           day.x          dep_time    sched_dep_time
 Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
 1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
 Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
 Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
 3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
 Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
                                                 NA's   :8255                 
   dep_delay          arr_time    sched_arr_time   arr_delay       
 Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
 1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
 Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
 Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
 3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
 Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
 NA's   :8255      NA's   :8713                  NA's   :9430      
   carrier              flight       tailnum             origin         
 Length:336776      Min.   :   1   Length:336776      Length:336776     
 Class :character   1st Qu.: 553   Class :character   Class :character  
 Mode  :character   Median :1496   Mode  :character   Mode  :character  
                    Mean   :1972                                        
                    3rd Qu.:3465                                        
                    Max.   :8500                                        
                                                                        
     dest              air_time        distance        hour.x     
 Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
 Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
 Mode  :character   Median :129.0   Median : 872   Median :13.00  
                    Mean   :150.7   Mean   :1040   Mean   :13.18  
                    3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
                    Max.   :695.0   Max.   :4983   Max.   :23.00  
                    NA's   :9430                                  
     minute        time_hour                     dep_cat         
 Min.   : 0.00   Min.   :2013-01-01 05:00:00   Length:336776     
 1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00   Class :character  
 Median :29.00   Median :2013-07-03 10:00:00   Mode  :character  
 Mean   :26.23   Mean   :2013-07-03 05:22:54                     
 3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00                     
 Max.   :59.00   Max.   :2013-12-31 23:00:00                     
                                                                 
   arr_cat          red_flight          year.y        month.y      
 Length:336776      Mode :logical   Min.   :2013   Min.   : 1.000  
 Class :character   FALSE:326143    1st Qu.:2013   1st Qu.: 4.000  
 Mode  :character   TRUE :10633     Median :2013   Median : 7.000  
                                    Mean   :2013   Mean   : 6.531  
                                    3rd Qu.:2013   3rd Qu.: 9.000  
                                    Max.   :2013   Max.   :12.000  
                                    NA's   :1556   NA's   :1556    
     day.y           hour.y           temp             dewp      
 Min.   : 1.00   Min.   : 1.00   Min.   : 10.94   Min.   :-9.94  
 1st Qu.: 8.00   1st Qu.: 9.00   1st Qu.: 42.08   1st Qu.:26.06  
 Median :16.00   Median :13.00   Median : 57.20   Median :42.80  
 Mean   :15.67   Mean   :13.17   Mean   : 57.00   Mean   :41.63  
 3rd Qu.:23.00   3rd Qu.:17.00   3rd Qu.: 71.96   3rd Qu.:57.92  
 Max.   :31.00   Max.   :23.00   Max.   :100.04   Max.   :78.08  
 NA's   :1556    NA's   :1556    NA's   :1573     NA's   :1573   
     humid           wind_dir       wind_speed       wind_gust     
 Min.   : 12.74   Min.   :  0.0   Min.   : 0.000   Min.   :16.11   
 1st Qu.: 43.99   1st Qu.:130.0   1st Qu.: 6.905   1st Qu.:20.71   
 Median : 57.73   Median :220.0   Median :10.357   Median :24.17   
 Mean   : 59.56   Mean   :201.5   Mean   :11.114   Mean   :25.25   
 3rd Qu.: 75.33   3rd Qu.:290.0   3rd Qu.:14.960   3rd Qu.:28.77   
 Max.   :100.00   Max.   :360.0   Max.   :42.579   Max.   :66.75   
 NA's   :1573     NA's   :9796    NA's   :1634     NA's   :256391  
     precip           pressure          visib       
 Min.   :0.00000   Min.   : 983.8   Min.   : 0.000  
 1st Qu.:0.00000   1st Qu.:1012.7   1st Qu.:10.000  
 Median :0.00000   Median :1017.5   Median :10.000  
 Mean   :0.00456   Mean   :1017.8   Mean   : 9.256  
 3rd Qu.:0.00000   3rd Qu.:1022.8   3rd Qu.:10.000  
 Max.   :1.21000   Max.   :1042.1   Max.   :10.000  
 NA's   :1556      NA's   :38788    NA's   :1556    

Step 4: Look at the top and bottom of the data

head(flights)
# A tibble: 6 × 35
  year.x month.x day.x dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int>   <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013       1     1      517            515         2      830            819
2   2013       1     1      533            529         4      850            830
3   2013       1     1      542            540         2      923            850
4   2013       1     1      544            545        -1     1004           1022
5   2013       1     1      554            600        -6      812            837
6   2013       1     1      554            558        -4      740            728
# ℹ 27 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour.x <dbl>, minute <dbl>, time_hour <dttm>, dep_cat <chr>, arr_cat <chr>,
#   red_flight <lgl>, year.y <int>, month.y <int>, day.y <int>, hour.y <int>,
#   temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>, wind_speed <dbl>,
#   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>
tail(flights)
# A tibble: 6 × 35
  year.x month.x day.x dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int>   <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013       9    30       NA           1842        NA       NA           2019
2   2013       9    30       NA           1455        NA       NA           1634
3   2013       9    30       NA           2200        NA       NA           2312
4   2013       9    30       NA           1210        NA       NA           1330
5   2013       9    30       NA           1159        NA       NA           1344
6   2013       9    30       NA            840        NA       NA           1020
# ℹ 27 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour.x <dbl>, minute <dbl>, time_hour <dttm>, dep_cat <chr>, arr_cat <chr>,
#   red_flight <lgl>, year.y <int>, month.y <int>, day.y <int>, hour.y <int>,
#   temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>, wind_speed <dbl>,
#   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>

Step 5: Visualize the distributions of key variables

summary(flights$temp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  10.94   42.08   57.20   57.00   71.96  100.04    1573 
summary(flights$dewp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  -9.94   26.06   42.80   41.63   57.92   78.08    1573 
summary(flights$humid)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  12.74   43.99   57.73   59.56   75.33  100.00    1573 
summary(flights$wind_dir)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0   130.0   220.0   201.5   290.0   360.0    9796 
summary(flights$wind_speed)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   6.905  10.357  11.114  14.960  42.579    1634 
summary(flights$wind_gust)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  16.11   20.71   24.17   25.25   28.77   66.75  256391 
summary(flights$precip)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
0.00000 0.00000 0.00000 0.00456 0.00000 1.21000    1556 
summary(flights$pressure)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  983.8  1012.7  1017.5  1017.8  1022.8  1042.1   38788 
summary(flights$visib)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000  10.000  10.000   9.256  10.000  10.000    1556 

For key variables, I define them by the variables that can help us understand the weather phenomena. Therefore, I visualize 9 variables, including temperature, dewp, humid, wind speed, wind direction, wind gust, precip, pressure and visibility.

From the weather variables, we observed many missing values that make it difficult to fully determine the weather condition. In particular, wind gust and pressure have the most missing values. Therefore, we should be careful when handling missing data during the analysis to avoid bias or incorrect conclusions.

vars <- c("temp","dewp","humid","wind_dir","wind_speed","wind_gust","pressure", "precip")

for (v in vars) {
  p <- flights %>%
    select(all_of(v)) %>%
    filter(!is.na(.data[[v]])) %>%
    ggplot(aes(x = .data[[v]])) +
    geom_histogram(color = "white", fill = "#990000") +
    labs(
      title = paste("Distribution of", v),
      x = v, y = "Count"
    ) +
    theme_minimal(base_size = 13)
  
  print(p)
}
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

flights %>%
  filter(!is.na(visib)) %>%
  ggplot(aes(x = visib)) +
  geom_histogram(binwidth = 1, boundary = 0, color = "white", fill = "#990000") +
  scale_x_continuous(breaks = seq(0, 10, 1)) +
  labs(title = "Distribution of visibility",
       x = "Visibility", y = "Count") +
  theme_minimal(base_size = 13)

flights %>%
  filter(!is.na(precip)) %>%
  ggplot(aes(y = precip)) +
  geom_boxplot(fill = "#990000", alpha = 0.7) +
  labs(title = "Boxplot of precipitation",
       y = "Precipitation", x = NULL) +
  theme_minimal(base_size = 13)

I add another box plot for precip to check the distribution.

Question 7

flights %>%
  filter(!is.na(dep_delay)) %>%
  mutate(delay_min = ifelse(dep_delay < 0, 0, dep_delay)) %>%
  group_by(year.x, month.x, day.x) %>%
  summarise(mean_delay = mean(delay_min), .groups = "drop") %>%
  arrange(desc(mean_delay)) 
# A tibble: 365 × 4
   year.x month.x day.x mean_delay
    <int>   <int> <int>      <dbl>
 1   2013       3     8       84.1
 2   2013       7     1       57.1
 3   2013       9     2       55.3
 4   2013       7    10       54.5
 5   2013      12     5       54.0
 6   2013       5    23       52.7
 7   2013       9    12       52.2
 8   2013       6    28       50.0
 9   2013       6    24       48.8
10   2013       7    22       48.4
# ℹ 355 more rows

Considering length of delay (early/ontime treated as 0), the worst average departure delay occurred on March 8, 2013, with an average delay of 84.1 minutes.

flights %>%
  filter(!is.na(dep_delay)) %>%
  mutate(delay_min = ifelse(dep_delay < 0, 0, dep_delay)) %>%
  group_by(origin, year.x, month.x, day.x) %>%
  summarise(mean_delay = mean(delay_min), .groups = "drop") %>%
  arrange(desc(mean_delay)) 
# A tibble: 1,095 × 5
   origin year.x month.x day.x mean_delay
   <chr>   <int>   <int> <int>      <dbl>
 1 LGA      2013       3     8      106. 
 2 EWR      2013       3     8       98.3
 3 LGA      2013       9     2       83.1
 4 LGA      2013      12     5       76.8
 5 JFK      2013       7    10       64.8
 6 LGA      2013       7     1       63.5
 7 EWR      2013      12     5       62.9
 8 EWR      2013       9    12       60.7
 9 JFK      2013       7     1       60.5
10 EWR      2013       5    23       59.6
# ℹ 1,085 more rows

By airport, the worst single day was LGA on March 8, 2013, with an average departure delay of 106.1 minutes.

flights %>%
  filter(!is.na(dep_delay)) %>%
  mutate(delay_min = ifelse(dep_delay < 0, 0, dep_delay)) %>%
  group_by(origin, year.x, month.x, day.x, hour.x) %>%
  summarise(mean_delay = mean(delay_min), .groups = "drop") %>%
  arrange(desc(mean_delay)) 
# A tibble: 19,434 × 6
   origin year.x month.x day.x hour.x mean_delay
   <chr>   <int>   <int> <int>  <dbl>      <dbl>
 1 LGA      2013       7    28     21       280.
 2 EWR      2013       2     9     10       269 
 3 EWR      2013       2     9      9       266 
 4 LGA      2013       9     2     16       250.
 5 LGA      2013       7    22     18       248.
 6 LGA      2013       7    28     19       240.
 7 JFK      2013       4    10     21       237 
 8 LGA      2013       9    12     20       226.
 9 EWR      2013       3     8     12       225.
10 LGA      2013      12     5     11       221.
# ℹ 19,424 more rows

The worst single hour was at LGA on July 28, 2013 at 21:00, with an average departure delay of 279.67 minutes.

Question 8

dest_delay <- flights %>%
  filter(!is.na(dep_delay)) %>%
  mutate(delay_min = ifelse(dep_delay < 0, 0, dep_delay)) %>%
  group_by(dest) %>%
  summarise(mean_delay = mean(delay_min), .groups = "drop")

airports_delay <- airports %>%
  inner_join(dest_delay, by = c("faa" = "dest"))
usa <- map_data("world") %>%
  filter(region %in% c("USA", "Puerto Rico"))

ggplot() +
  geom_polygon(data = usa,
               aes(long, lat, group = group),
               fill = "grey98", color = "grey85", linewidth = 0.2) +
  geom_point(data = airports_delay,
             aes(x = lon, y = lat, color = mean_delay),
             alpha = 0.85) +
  scale_color_viridis_c(option = "plasma", direction = 1,
                        name = "Avg dep delay (min)") +
  coord_quickmap(xlim = c(-170, -60), ylim = c(15, 72), expand = FALSE) +
  labs(title = "2013 Average Departure Delay by Destination Airport",
       x = NULL, y = NULL) +
  theme_void(base_size = 12) +
  theme(legend.position = "right",
        plot.title = element_text(margin = margin(l = 10), face = "bold"))

Question 9

flights_delay <- flights %>%
  filter(!is.na(dep_delay)) %>%
  mutate(delay_min = pmax(dep_delay, 0),
         delay_flag = ifelse(dep_delay > 0, T, F))

# Distribution of Delay Minutes (of all flights)
summary(flights_delay$delay_min)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    0.00   15.39   11.00 1301.00 
# Distribution of Delay Minutes (of delayed flights)
summary(flights_delay$delay_min[flights_delay$delay_flag == T])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    6.00   19.00   39.37   50.00 1301.00 

Scatter Plots of Delay vs Weather Phenomena

make_plot <- function(df, var, xlab = var) {
  df %>%
    filter(!is.na(.data[[var]])) %>%
    group_by(origin, time_hour) %>%
    summarise(mean_delay = mean(delay_min, na.rm = TRUE),
              xvar = first(.data[[var]]), .groups = "drop") %>%
    group_by(xvar) %>%
    summarise(mean_delay = mean(mean_delay, na.rm = TRUE), .groups = "drop") %>%
    ggplot(aes(x = xvar, y = mean_delay)) +
      geom_point(alpha = 0.6, size = 2) +
      geom_smooth(method = lm, se = TRUE, linewidth = 1) +
      labs(title = paste("Departure Delay vs.", xlab),
           x = xlab,
           y = "Average departure delay (minutes)") +
      theme_classic(base_size = 13) +
      theme(plot.title = element_text(face = "bold", margin = margin(b = 6)),
            legend.position = "top")
}

vars <- c("temp","dewp","humid","wind_speed","wind_gust","precip","pressure","visib")
labels <- c(temp = "Temperature (°F)",
            dewp = "Dew point (°F)",
            humid = "Humidity (%)",
            wind_speed = "Wind speed (mph)",
            wind_gust = "Wind gust (mph)",
            precip = "Precipitation (inches)",
            pressure = "Pressure (mb)",
            visib = "Visibility (miles)")

for (v in vars) {
  p <- make_plot(flights_delay, v, labels[[v]])
  print(p)
}
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

Correlation

delay_corr <- flights_delay %>%
  group_by(origin, time_hour) %>%
  summarise(mean_delay = mean(dep_delay),
            across(all_of(vars), ~first(.x)),
            .groups = "drop") %>%
  select(mean_delay, all_of(vars)) %>%
  na.omit() %>%
  as.matrix()

head(delay_corr)
     mean_delay  temp  dewp humid wind_speed wind_gust precip pressure visib
[1,]  22.320000 37.04 19.94 49.62   13.80936  20.71404      0   1012.1    10
[2,]  13.714286 33.98 15.08 45.43   12.65858  25.31716      0   1014.1    10
[3,]  -2.444444 30.02 12.92 48.48   18.41248  26.46794      0   1016.0    10
[4,]  41.750000 28.94 12.02 48.69   18.41248  25.31716      0   1016.5    10
[5,]   7.600000 30.92 12.02 44.92   14.96014  25.31716      0   1017.5    10
[6,]  17.791667 32.00 12.92 44.74   12.65858  23.01560      0   1017.1    10
cor(delay_corr) %>%
  ggcorrplot(hc.order = TRUE,
           type = "lower",
           lab = TRUE,
           lab_size = 3,
           method="circle",
           colors = c("steelblue", "white", "orangered3"),
           title="Correlation",
           ggtheme=theme_bw)

Sensitivity analysis

make_plot_by_airport <- function(df, var, xlab = var) {
  df %>%
    filter(!is.na(.data[[var]])) %>%
    group_by(origin, time_hour) %>%
    summarise(mean_delay = mean(delay_min, na.rm = TRUE),
              xvar = first(.data[[var]]), .groups = "drop") %>%
    group_by(origin, xvar) %>%
    summarise(mean_delay = mean(mean_delay, na.rm = TRUE), .groups = "drop") %>%
    ggplot(aes(x = xvar, y = mean_delay)) +
      geom_point(alpha = 0.6, size = 2, , color = "darksalmon") +
      geom_smooth(method = lm, se = TRUE, linewidth = 1, , color = "gold") +
      facet_wrap(~ origin, nrow = 1) +
      labs(title = paste("Departure Delay vs.", xlab, "(by Origin Airport)"),
           x = xlab,
           y = "Average departure delay (minutes)") +
      theme_classic(base_size = 12) +
      theme(
        plot.title = element_text(face = "bold", margin = margin(b = 6)),
        plot.subtitle = element_text(margin = margin(b = 10)),
        strip.background = element_rect(fill = "grey95", color = NA),
        strip.text = element_text(face = "bold"),
        plot.caption = element_text(color = "grey40"),
        axis.title.x = element_text(margin = margin(t = 6)),
        axis.title.y = element_text(margin = margin(r = 6))
      )
}

for (v in vars) {
  p <- make_plot_by_airport(flights_delay, v, labels[[v]])
  print(p)
}
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

Summary Table

flights_delay %>%
  group_by(delay_flag) %>%
  summarise()
# A tibble: 2 × 1
  delay_flag
  <lgl>     
1 FALSE     
2 TRUE      
qs <- quantile(flights_delay$dep_delay[flights_delay$delay_flag == TRUE], 
               probs = c(.25, .50, .75), na.rm = TRUE)

flights_delay <- flights_delay %>%
  mutate(delay_gp = cut(
    delay_min,
    breaks = c(-Inf, qs[1], qs[2], qs[3], Inf),
    labels = c("min–Q1", "Q1–Q2", "Q2–Q3", "Q3–max"),
    include.lowest = TRUE
  ))

flights_delay %>%
  filter(delay_flag == T) %>%
  group_by(delay_gp) %>%
  summarise(
    n_flights = n(),
    mean_delay = mean(delay_min, na.rm = TRUE),
    across(all_of(vars), ~ mean(.x, na.rm = TRUE), .names = "mean_{.col}"),
    # optional NA counts per variable:
    # across(all_of(vars), ~ sum(is.na(.x)), .names = "na_{.col}"),
    .groups = "drop"
  )
# A tibble: 4 × 11
  delay_gp n_flights mean_delay mean_temp mean_dewp mean_humid mean_wind_speed
  <fct>        <int>      <dbl>     <dbl>     <dbl>      <dbl>           <dbl>
1 min–Q1       32776       3.08      57.0      41.2       58.6            11.1
2 Q1–Q2        32319      12.2       57.3      41.5       58.9            11.4
3 Q2–Q3        31264      32.3       57.9      43.0       60.9            11.7
4 Q3–max       32073     111.        59.9      46.8       65.6            11.7
# ℹ 4 more variables: mean_wind_gust <dbl>, mean_precip <dbl>,
#   mean_pressure <dbl>, mean_visib <dbl>

Summary

Method: Weather is recorded at the origin–hour level, so many flights share the same weather record (clustering). To keep plots interpretable, I first averaged departure delays by origin × hour, then, for each weather variable, averaged again by that variable’s value; each point therefore represents the mean delay (min) at a given weather level.

Results: From the scatter plots and the correlation heat map, humidity shows the strongest positive association with mean departure delay (r ≈ +0.24), followed by dew point (≈ +0.19). Visibility is negatively related to delay (≈ −0.18), consistent with more delays under low-visibility conditions. Precipitation and wind gusts are also associated with higher delays, though with greater variability. Also, to make sure the results align, I repeated the plots by origin (JFK/LGA/EWR) for sensitivity analysis. The overall ranking was similar, although wind gusts and precipitation differed most across airports—suggesting airport-specific effects.

Conclusion: Descriptively, high humidity / high dew point and low visibility are most strongly linked to larger departure delays, with precipitation and gusts contributing in an airport-dependent way.

Limitations: Because multiple flights share the same hourly weather, correlations may overstate precision, and several weather variables are highly collinear (e.g., temp–dew point; wind speed–gusts). A mixed-effects model with random effects for origin and time would better adjust for clustering and quantify effect sizes.